In [166]:

    
%matplotlib inline
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

from pandas import DataFrame
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn.base import TransformerMixin
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import f_classif
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import FeatureUnion



In [167]:

    
data_dict = pickle.load(open("../ud120-projects/final_project/final_project_dataset.pkl", "r") )

Holdout

Since the dataset for this project is so small, a hold-out set will not be used, and only k-fold testing and training splits will be used to measure accuracy.

This is because even with a stratified hold-out set of 20%, with only 146 data points, lots of missing data and and 18 poi's, there would be only 3 or so points to do a final test on. This does not give much confidence in the precision of the performance metrics on such a small hold-out set, while also negatively impacting the ability to create the model.

"when the number of samples is not large, a strong case can be made that a test set should be avoided because every sample may be needed for model building. (...) Additionally, the size of the test set may not have sufficient power or precision to make reasonable judgements. "

[1] Kuhn M., Kjell J.(2013). Applied Predictive Modeling. Springer. pp.67

Hawkins et al. (2003) concisely summarize this point:“holdout samples of tolerable size [. . . ] do not match the cross-validation itself for reliability in assessing model fit and are hard to motivate.”

[2] Hawkins D, Basak S, Mills D (2003). “Assessing Model Fit by Cross– Validation.” Journal of Chemical Information and Computer Sciences, 43(2), 579–586

This will be addressed with K-fold cross-validation resampling techniques.

Version 2 - Cross Validation Scheme

Define sets of model parameters values to evaluate
for each parameter set in grid search DO
1. For each k-fold resampling iteration DO
  1. Hold-out 1/k samples/fold
  2. Pre-Process Data (Create functions on training set, apply to test set with same)
    1. Impute data (median)
    2. Scale features (x_i - mean))/std
    3. Perform any univariate feature selection (remove very low variation features)
    4. Modeling feature selection (ExtraTreesClassifier)
  3. Fit the model on the k/K training fold
  4. Predict the hold-out samples/fold
2. END
3. Calculate the average performance across hold-out predictions
END
Determine the optimal parameter set
Fit the final model to all training data using the optimal parameter set



In [168]:

    
df = pd.DataFrame.from_dict(data_dict, orient='index')

'NaN' was imported as a string instead of a a missing value. We will convert these to NaN type and look how many missing values our data has.



In [169]:

    
# Replace 'NaN' strings with actual np.nan values
df = df.replace('NaN', np.nan)
# Replace email strings with True/False boolean as to whether an email was present or not
df['email_address'] = df['email_address'].fillna(0).apply(lambda x: x != 0, 1)



In [170]:

    
# Replace with index watcher
# A quick look at the original finanical spreadsheet shows TOTAL at the bottom 
# totaling all entries for everyone. This is obviously an outlier with no 
# meaningful information and can be removed.

# df[df['salary'] > 1000000]
# df[df.index == 'TOTAL']
df = df.drop('TOTAL', axis=0)



In [170]:

By default, the GridSearchCV uses a 3-fold cross-validation. However, if it detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold.

http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html

Remove columns with less than 50% of entries present.

Remove rows with no non-NA values



In [171]:

    
# low_var_remover = VarianceThreshold(threshold=.5)



In [172]:

    
# ************************
# Encode as 0 instead.
# Remove columns with more than 50% NA's
# df_50 = df.dropna(axis=1, thresh=len(df)/2)
# ************************

# Since email_address and poi are True/False, every record should have at least 2 non-NA.
# We'll next remove any rows that don't have at least 2 non-NA values besides these.
# The criteria is: No more than 11 NA's per row.
# df_50 = df_50.dropna(axis=0, thresh=5)

# 128 records remain.
# df_50.info()

Financial NA's

When looking at the source of the data, the NA entries in the financial data seem values that are reported as zero since all payments/stock values add up to the total payments/stocks values. These NA values should then be set to 0 to add up to the totals reported by the accounting spreadsheet.

Email statistics NA's

The missing values for NA's for email statistics may be a little more subjective.

Some email statistics are features created with prior knowledge of the entire dataset (i.e. emails to/from poi's). This may be data snooping, since if new data/pois were somehow introduced, it would not be possible to generate these features without prior knowledge of which new data were the poi's.
NA's here imply that the person did not have an email account with Enron, or were not involved in emailing by some other way.

This means all email data features ar NA if even one column had missing email data for that person. It is hard to judge any distribution that they could have if they were given an email account since they have no ties to the financial data to infer distributions.

We have no real way to infer a person having sent/recieved 10 emails or 10,000 from completely unrelated financial data from a different dataset with many different people.

For this reason, these NA will also be encoded as 0.



In [173]:

    
df = df.apply(lambda x: x.fillna(0), axis=0)



In [173]:

Imputation



In [174]:

    
import seaborn as sns
sns.set(style='darkgrid')

f, ax = plt.subplots(figsize=(14, 14))
cmap = sns.diverging_palette(10, 220, as_cmap=True)
sns.corrplot(df.corr(), annot=True, sig_stars=False,
             diag_names=False, cmap=cmap, ax=ax)
f.tight_layout()



In [175]:

    
# Pick a column which we are predicting.
# Find other variables correlated to used KMeansNeighborsRegression to predict/impute
# the missing values.
# df_50.corr().ix[: ,'salary']



In [176]:

    
# cols1 = ['salary', 'other', 'total_stock_value', 'exercised_stock_options', 
#        'total_payments', 'restricted_stock']
# Bonus and salary values don't seem to be missing at random. Anytime there is a null value
# for salary, there is also one for bonus. So bonus can't be used to predict salary on
# the first pass. Predicted salary values will be used to predict bonus values though 
# on a second pass.
# cols2= ['salary', 'other', 'total_stock_value', 'exercised_stock_options', 
#        'total_payments', 'restricted_stock', 'bonus']
# cols3 = ['to_messages', 'from_this_person_to_poi', 'from_messages', 
# 'shared_receipt_with_poi', 'from_poi_to_this_person']



In [177]:

    
def kcluster_null(df=None, cols=None, process_all=True):
    '''
    Input: Takes pandas dataframe with values to impute, and a list of columns to impute
        and use for imputing.
    Returns: Pandas dataframe with null values imputed for list of columns passed in.
    
    # Ideally columns should be somewhat correlated since they will be used in KNN to
    # predict each other, one column at a time.
    
    '''
    
    # Create a KNN regression estimator for 
    income_imputer = KNeighborsRegressor(n_neighbors=1)
    # Loops through the columns passed in to impute each one sequentially.
    
    if not process_all:
        to_pred = cols[0]
        predictor_cols = cols[1:]
        
        
    for each in cols:
        # Create a temp list that does not include the column being predicted.
        temp_cols = [col for col in cols if col != each]
        # Create a dataframe that contains no missing values in the columns being predicted.
        # This will be used to train the KNN estimator.
        df_col = df[df[each].isnull()==False]
        
        # Create a dataframe with all of the nulls in the column being predicted.
        df_null_col = df[df[each].isnull()==True]
        
        # Create a temp dataframe filling in the medians for each column being used to
        # predict that is missing values.
        # This step is needed since we have so many missing values distributed through 
        # all of the columns.
        temp_df_medians = df_col[temp_cols].apply(lambda x: x.fillna(x.median()), axis=0)
        
        # Fit our KNN imputer to this dataframe now that we have values for every column.
        income_imputer.fit(temp_df_medians, df_col[each])
        
        # Fill the df (that has null values being predicted) with medians in the other
        # columns not being predicted.
        # ** This currently uses its own medians and should ideally use the predictor df's
        # ** median values to fill in NA's of columns being used to predict.
        temp_null_medians = df_null_col[temp_cols].apply(lambda x: x.fillna(x.median()), axis=0)
        
        # Predict the null values for the current 'each' variable.
        new_values = income_imputer.predict(temp_null_medians[temp_cols])

        # Replace the null values of the original null dataframe with the predicted values.
        df_null_col[each] = new_values
        
        # Append the new predicted nulls dataframe to the dataframe which containined
        # no null values.
        # Overwrite the original df with this one containing predicted columns. 
        # Index order will not be preserved since it is rearranging each time by 
        # null values.
        df = df_col.append(df_null_col)
        
    # Returned final dataframe sorted by the index names.
    return df.sort_index(axis=0)



In [177]:



In [178]:

    
df.irow(127)









    Out[178]:





salary                            0
to_messages                       0
deferral_payments                 0
total_payments               362096
exercised_stock_options           0
bonus                             0
restricted_stock                  0
shared_receipt_with_poi           0
restricted_stock_deferred         0
total_stock_value                 0
expenses                          0
loan_advances                     0
from_messages                     0
other                        362096
from_this_person_to_poi           0
poi                           False
director_fees                     0
deferred_income                   0
long_term_incentive               0
email_address                 False
from_poi_to_this_person           0
Name: THE TRAVEL AGENCY IN THE PARK, dtype: object



In [179]:

    
#cols = [x for x in df.columns]
#for each in cols:
#    g = sns.FacetGrid(df, col='poi', margin_titles=True, size=6)
#    g.map(plt.hist, each, color='steelblue')



In [180]:

    
from pandas.tools.plotting import scatter_matrix



In [181]:

    
list(df.columns)









    Out[181]:





['salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'bonus',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'loan_advances',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'poi',
 'director_fees',
 'deferred_income',
 'long_term_incentive',
 'email_address',
 'from_poi_to_this_person']



In [182]:

    
financial_cols = np.array(['salary', 'deferral_payments', 'total_payments', 'exercised_stock_options', 
                  'bonus', 'restricted_stock', 'restricted_stock_deferred', 'total_stock_value',
                  'expenses', 'loan_advances', 'other', 'director_fees', 'deferred_income', 
                  'long_term_incentive'])

email_cols = np.array(['from_messages', 'to_messages', 'shared_receipt_with_poi', 
              'from_this_person_to_poi', 'from_poi_to_this_person', 'email_address'])



In [183]:

    
from sklearn.ensemble import RandomForestClassifier



In [184]:

    
clf = RandomForestClassifier(n_estimators=1000)
clf.fit(df[financial_cols], df['poi'])









    Out[184]:





RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)



In [185]:

    
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)



In [186]:

    
padding = np.arange(len(financial_cols)) + 0.5
plt.figure(figsize=(14, 12))
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, financial_cols[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()



In [187]:

    
clf = RandomForestClassifier(n_estimators=1000)
clf.fit(df[email_cols], df['poi'])

importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(len(email_cols)) + 0.5
plt.figure(figsize=(14, 12))
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, email_cols[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()



In [188]:

    
all_cols = np.concatenate([email_cols, financial_cols])
clf = RandomForestClassifier(n_estimators=1000)
clf.fit(df[all_cols], df['poi'])

importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(len(all_cols)) + 0.5
plt.figure(figsize=(14, 12))
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, all_cols[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()



In [189]:

    
df['ex_stock_bins'] = pd.cut(df.exercised_stock_options, bins=15, labels=False)
pd.value_counts(df.ex_stock_bins)









    Out[189]:





0     118
1      11
2       6
3       4
8       2
14      1
13      1
6       1
4       1
dtype: int64



In [190]:

    
df.exercised_stock_options.plot()









    Out[190]:





<matplotlib.axes._subplots.AxesSubplot at 0x1ecc1ac8>



In [191]:

    
def capValues(x, cap):
    return (cap if x > cap else x)



In [192]:

    
df.exercised_stock_options = df.exercised_stock_options.apply(lambda x: capValues(x, 5000000))



In [193]:

    
df['ex_stock_bins'] = pd.cut(df.exercised_stock_options, bins=15, labels=False)
pd.value_counts(df.ex_stock_bins)









    Out[193]:





0     60
1     18
14    16
2     13
4     12
6      7
3      5
5      4
12     3
7      3
13     2
9      2
dtype: int64



In [194]:

    
df[['ex_stock_bins', 'poi']].groupby('ex_stock_bins').mean().plot()









    Out[194]:





<matplotlib.axes._subplots.AxesSubplot at 0x179937f0>



In [195]:

    
df.columns









    Out[195]:





Index([u'salary', u'to_messages', u'deferral_payments', u'total_payments', u'exercised_stock_options', u'bonus', u'restricted_stock', u'shared_receipt_with_poi', u'restricted_stock_deferred', u'total_stock_value', u'expenses', u'loan_advances', u'from_messages', u'other', u'from_this_person_to_poi', u'poi', u'director_fees', u'deferred_income', u'long_term_incentive', u'email_address', u'from_poi_to_this_person', u'ex_stock_bins'], dtype='object')



In [196]:

    
df[['bonus', 'poi']].groupby('bonus').mean().plot()









    Out[196]:





<matplotlib.axes._subplots.AxesSubplot at 0x1caeb780>



In [197]:

    
df.shared_receipt_with_poi.plot()









    Out[197]:





<matplotlib.axes._subplots.AxesSubplot at 0x1b1ce9e8>



In [198]:

    
max(df.shared_receipt_with_poi)









    Out[198]:





5521.0



In [199]:

    
# Create bins for shared receipt with poi
my_bins = [min(df.shared_receipt_with_poi)] + [250] + range(500, 5000, 500) + [max(df.shared_receipt_with_poi)]
df['shared_poi_bins'] = pd.cut(df.shared_receipt_with_poi, bins=my_bins, labels=False, include_lowest=True)
pd.value_counts(df['shared_poi_bins'])









    Out[199]:





0     81
2     19
5     11
3      9
1      9
4      6
8      4
6      4
10     2
dtype: int64



In [199]:



In [200]:

    
df[['shared_poi_bins', 'poi']].groupby('shared_poi_bins').mean().plot()









    Out[200]:





<matplotlib.axes._subplots.AxesSubplot at 0x16ed2e10>



In [201]:

    
df.total_stock_value









    Out[201]:





ALLEN PHILLIP K          1729541
BADUM JAMES P             257817
BANNANTINE JAMES M       5243487
BAXTER JOHN C           10623258
BAY FRANKLIN R             63014
BAZELIDES PHILIP J       1599641
BECK SALLY W              126027
BELDEN TIMOTHY N         1110705
BELFER ROBERT             -44093
BERBERIAN DAVID          2493616
BERGSIEKER RICHARD P      659249
BHATNAGAR SANJAY               0
BIBI PHILIPPE A          1843816
BLACHMAN JEREMY M         954354
BLAKE JR. NORMAN P             0
...
UMANOFF ADAM S                  0
URQUHART JOHN A                 0
WAKEHAM JOHN                    0
WALLS JR ROBERT H         5898997
WALTERS GARETH W          1030329
WASAFF GEORGE             2056427
WESTFAHL RICHARD K         384930
WHALEY DAVID A              98718
WHALLEY LAWRENCE G        6079137
WHITE JR THOMAS E        15144123
WINOKUR JR. HERBERT S           0
WODRASKA JOHN                   0
WROBEL BRUCE               139130
YEAGER F SCOTT           11884758
YEAP SOON                  192758
Name: total_stock_value, Length: 145, dtype: float64



In [201]:



In [202]:

    
from sklearn.preprocessing import StandardScaler

df['total_stock_scaled'] = StandardScaler().fit_transform(df[['total_stock_value']])
df['bonus_scaled'] = StandardScaler().fit_transform(df[['bonus']])

print df.total_stock_scaled.describe() plt.hist(df.total_stock_scaled)



In [203]:

    
def dont_neg_log(x):
    if x >=0:
        return np.log1p(x)
    else:
        return 0
    
df['stock_log'] = df['total_stock_value'].apply(lambda x: dont_neg_log(x))

Feature Ratio Creation



In [204]:

    
financial_cols = np.array(['salary', 'deferral_payments', 'total_payments', 'exercised_stock_options', 
                  'bonus', 'restricted_stock', 'restricted_stock_deferred', 'total_stock_value',
                  'expenses', 'loan_advances', 'other', 'director_fees', 'deferred_income', 
                  'long_term_incentive'])

email_cols = np.array(['from_messages', 'to_messages', 'shared_receipt_with_poi', 
              'from_this_person_to_poi', 'from_poi_to_this_person', 'email_address'])



In [205]:

    
payment_comp = ['salary', 'deferral_payments','bonus', 'expenses', 'loan_advances',
                'other', 'director_fees', 'deferred_income', 'long_term_incentive']
payment_total = ['total_payments']

stock_comp = ['exercised_stock_options', 'restricted_stock','restricted_stock_deferred',]
stock_total = ['total_stock_value']

all_comp = payment_comp + stock_comp

email_comp = ['shared_receipt_with_poi', 'from_this_person_to_poi', 'from_poi_to_this_person' ]
email_totals = ['from_messages', 'to_messages'] # interaction_w_poi = total(from/to/shared poi)



In [205]:



In [206]:

    
df['total_compensation'] = df['total_payments'] + df['total_stock_value']

for each in payment_comp:
    df['{0}_{1}_ratio'.format(each, 'total_pay')] = df[each]/df['total_payments']

for each in stock_comp:
    df['{0}_{1}_ratio'.format(each, 'total_stock')] = df[each]/df['total_stock_value']

for each in all_comp:
    df['{0}_{1}_ratio'.format(each, 'total_compensation')] = df[each]/df['total_compensation']
    
    
df['total_poi_interaction'] = df['shared_receipt_with_poi'] + df['from_this_person_to_poi'] + \
df['from_poi_to_this_person']

for each in email_comp:
    df['{0}_{1}_ratio'.format(each, 'total_poi_int')] = df[each]/df['total_poi_interaction']

df['total_active_poi_interaction'] = df['from_this_person_to_poi'] + df['from_poi_to_this_person']
df['to_poi_total_active_poi_ratio'] = df['from_this_person_to_poi']/df['total_active_poi_interaction']
df['from_poi_total_active_poi_ratio'] = df['from_poi_to_this_person']/df['total_active_poi_interaction']

df['to_messages_to_poi_ratio'] = df['from_this_person_to_poi']/ df['to_messages']
df['from_messages_from_poi_ratio'] = df['from_poi_to_this_person']/df['from_messages']
df['shared_poi_from_messages_ratio'] = df['shared_receipt_with_poi']/df['from_messages']

A good portion of people were paid either only in stock or payments. Another good portion also didn't have email statistics available.

These ratios will need to be set to zero manually due to division by 0 - NaN.



In [207]:

    
df = df.apply(lambda x: x.fillna(0), axis=0)



In [209]:

    
df[['poi', 'from_messages', 'to_messages', 'shared_receipt_with_poi','total_active_poi_interaction']]









    Out[209]:






  
    
      
      poi
      from_messages
      to_messages
      shared_receipt_with_poi
      total_active_poi_interaction
    
  
  
    
      ALLEN PHILLIP K
       False
       2195
        2902
       1407
       112
    
    
      BADUM JAMES P
       False
          0
           0
          0
         0
    
    
      BANNANTINE JAMES M
       False
         29
         566
        465
        39
    
    
      BAXTER JOHN C
       False
          0
           0
          0
         0
    
    
      BAY FRANKLIN R
       False
          0
           0
          0
         0
    
    
      BAZELIDES PHILIP J
       False
          0
           0
          0
         0
    
    
      BECK SALLY W
       False
       4343
        7315
       2639
       530
    
    
      BELDEN TIMOTHY N
        True
        484
        7991
       5521
       336
    
    
      BELFER ROBERT
       False
          0
           0
          0
         0
    
    
      BERBERIAN DAVID
       False
          0
           0
          0
         0
    
    
      BERGSIEKER RICHARD P
       False
         59
         383
        233
         4
    
    
      BHATNAGAR SANJAY
       False
         29
         523
        463
         1
    
    
      BIBI PHILIPPE A
       False
         40
        1607
       1336
        31
    
    
      BLACHMAN JEREMY M
       False
         14
        2475
       2326
        27
    
    
      BLAKE JR. NORMAN P
       False
          0
           0
          0
         0
    
    
      BOWEN JR RAYMOND M
        True
         27
        1858
       1593
       155
    
    
      BROWN MICHAEL
       False
         41
        1486
        761
        14
    
    
      BUCHANAN HAROLD G
       False
        125
        1088
         23
         0
    
    
      BUTTS ROBERT H
       False
          0
           0
          0
         0
    
    
      BUY RICHARD B
       False
       1053
        3523
       2333
       227
    
    
      CALGER CHRISTOPHER F
        True
        144
        2598
       2188
       224
    
    
      CARTER REBECCA C
       False
         15
         312
        196
        36
    
    
      CAUSEY RICHARD A
        True
         49
        1892
       1585
        70
    
    
      CHAN RONNIE
       False
          0
           0
          0
         0
    
    
      CHRISTODOULOU DIOMEDES
       False
          0
           0
          0
         0
    
    
      CLINE KENNETH W
       False
          0
           0
          0
         0
    
    
      COLWELL WESLEY
        True
         40
        1758
       1132
       251
    
    
      CORDES WILLIAM R
       False
         12
         764
         58
        10
    
    
      COX DAVID
       False
         33
         102
         71
         4
    
    
      CUMBERLAND MICHAEL S
       False
          0
           0
          0
         0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      SCRIMSHAW MATTHEW
       False
          0
           0
          0
         0
    
    
      SHANKMAN JEFFREY A
       False
       2681
        3221
       1730
       177
    
    
      SHAPIRO RICHARD S
       False
       1215
       15149
       4527
       139
    
    
      SHARP VICTORIA T
       False
        136
        3136
       2477
        30
    
    
      SHELBY REX
        True
         39
         225
         91
        27
    
    
      SHERRICK JEFFREY B
       False
         25
         613
        583
        57
    
    
      SHERRIFF JOHN R
       False
         92
        3187
       2103
        51
    
    
      SKILLING JEFFREY K
        True
        108
        3627
       2042
       118
    
    
      STABLER FRANK
       False
          0
           0
          0
         0
    
    
      SULLIVAN-SHAKLOVITZ COLLEEN
       False
          0
           0
          0
         0
    
    
      SUNDE MARTIN
       False
         38
        2647
       2565
        50
    
    
      TAYLOR MITCHELL S
       False
         29
         533
        300
         0
    
    
      THE TRAVEL AGENCY IN THE PARK
       False
          0
           0
          0
         0
    
    
      THORN TERENCE H
       False
         41
         266
         73
         0
    
    
      TILNEY ELIZABETH A
       False
         19
         460
        379
        21
    
    
      UMANOFF ADAM S
       False
         18
         111
         41
        12
    
    
      URQUHART JOHN A
       False
          0
           0
          0
         0
    
    
      WAKEHAM JOHN
       False
          0
           0
          0
         0
    
    
      WALLS JR ROBERT H
       False
        146
         671
        215
        17
    
    
      WALTERS GARETH W
       False
          0
           0
          0
         0
    
    
      WASAFF GEORGE
       False
         30
         400
        337
        29
    
    
      WESTFAHL RICHARD K
       False
          0
           0
          0
         0
    
    
      WHALEY DAVID A
       False
          0
           0
          0
         0
    
    
      WHALLEY LAWRENCE G
       False
        556
        6019
       3920
       210
    
    
      WHITE JR THOMAS E
       False
          0
           0
          0
         0
    
    
      WINOKUR JR. HERBERT S
       False
          0
           0
          0
         0
    
    
      WODRASKA JOHN
       False
          0
           0
          0
         0
    
    
      WROBEL BRUCE
       False
          0
           0
          0
         0
    
    
      YEAGER F SCOTT
        True
          0
           0
          0
         0
    
    
      YEAP SOON
       False
          0
           0
          0
         0
    
  

145 rows × 5 columns



In [210]:

    
df[df['poi']==True]









    Out[210]:






  
    
      
      salary
      to_messages
      deferral_payments
      total_payments
      exercised_stock_options
      bonus
      restricted_stock
      shared_receipt_with_poi
      restricted_stock_deferred
      total_stock_value
      ...
      total_poi_interaction
      shared_receipt_with_poi_total_poi_int_ratio
      from_this_person_to_poi_total_poi_int_ratio
      from_poi_to_this_person_total_poi_int_ratio
      total_active_poi_interaction
      to_poi_total_active_poi_ratio
      from_poi_total_active_poi_ratio
      to_messages_to_poi_ratio
      from_messages_from_poi_ratio
      shared_poi_from_messages_ratio
    
  
  
    
      BELDEN TIMOTHY N
        213999
       7991
       2144013
         5501630
        953136
       5249999
         157569
       5521
       0
        1110705
      ...
       5857
       0.942633
       0.018439
       0.038928
       336
       0.321429
       0.678571
       0.013515
       0.471074
       11.407025
    
    
      BOWEN JR RAYMOND M
        278601
       1858
             0
         2669589
             0
       1350000
         252055
       1593
       0
         252055
      ...
       1748
       0.911327
       0.008581
       0.080092
       155
       0.096774
       0.903226
       0.008073
       5.185185
       59.000000
    
    
      CALGER CHRISTOPHER F
        240189
       2598
             0
         1639297
             0
       1250000
         126027
       2188
       0
         126027
      ...
       2412
       0.907131
       0.010365
       0.082504
       224
       0.111607
       0.888393
       0.009623
       1.381944
       15.194444
    
    
      CAUSEY RICHARD A
        415189
       1892
             0
         1868758
             0
       1000000
        2502063
       1585
       0
        2502063
      ...
       1655
       0.957704
       0.007251
       0.035045
        70
       0.171429
       0.828571
       0.006342
       1.183673
       32.346939
    
    
      COLWELL WESLEY
        288542
       1758
         27610
         1490344
             0
       1200000
         698242
       1132
       0
         698242
      ...
       1383
       0.818510
       0.007954
       0.173536
       251
       0.043825
       0.956175
       0.006257
       6.000000
       28.300000
    
    
      DELAINEY DAVID W
        365163
       3093
             0
         4747979
       2291113
       3000000
        1323148
       2097
       0
        3614261
      ...
       2772
       0.756494
       0.219697
       0.023810
       675
       0.902222
       0.097778
       0.196896
       0.021505
        0.683284
    
    
      FASTOW ANDREW S
        440698
          0
             0
         2424083
             0
       1300000
        1794412
          0
       0
        1794412
      ...
          0
       0.000000
       0.000000
       0.000000
         0
       0.000000
       0.000000
       0.000000
       0.000000
        0.000000
    
    
      GLISAN JR BEN F
        274975
        873
             0
         1272284
        384728
        600000
         393818
        874
       0
         778546
      ...
        932
       0.937768
       0.006438
       0.055794
        58
       0.103448
       0.896552
       0.006873
       3.250000
       54.625000
    
    
      HANNON KEVIN P
        243293
       1045
             0
          288682
       5000000
       1500000
         853064
       1035
       0
        6391065
      ...
       1088
       0.951287
       0.019301
       0.029412
        53
       0.396226
       0.603774
       0.020096
       1.000000
       32.343750
    
    
      HIRKO JOSEPH
             0
          0
         10259
           91093
       5000000
             0
              0
          0
       0
       30766064
      ...
          0
       0.000000
       0.000000
       0.000000
         0
       0.000000
       0.000000
       0.000000
       0.000000
        0.000000
    
    
      KOENIG MARK E
        309946
       2374
             0
         1587421
        671737
        700000
        1248318
       2271
       0
        1920055
      ...
       2339
       0.970928
       0.006413
       0.022659
        68
       0.220588
       0.779412
       0.006318
       0.868852
       37.229508
    
    
      KOPPER MICHAEL J
        224305
          0
             0
         2652612
             0
        800000
         985032
          0
       0
         985032
      ...
          0
       0.000000
       0.000000
       0.000000
         0
       0.000000
       0.000000
       0.000000
       0.000000
        0.000000
    
    
      LAY KENNETH L
       1072321
       4273
        202911
       103559793
       5000000
       7000000
       14761694
       2411
       0
       49110078
      ...
       2550
       0.945490
       0.006275
       0.048235
       139
       0.115108
       0.884892
       0.003744
       3.416667
       66.972222
    
    
      RICE KENNETH D
        420636
        905
             0
          505050
       5000000
       1750000
        2748364
        864
       0
       22542539
      ...
        910
       0.949451
       0.004396
       0.046154
        46
       0.086957
       0.913043
       0.004420
       2.333333
       48.000000
    
    
      RIEKER PAULA H
        249201
       1328
        214678
         1099100
       1635238
        700000
         283649
       1258
       0
        1918887
      ...
       1341
       0.938106
       0.035794
       0.026100
        83
       0.578313
       0.421687
       0.036145
       0.426829
       15.341463
    
    
      SHELBY REX
        211844
        225
             0
         2003885
       1624396
        200000
         869220
         91
       0
        2493616
      ...
        118
       0.771186
       0.118644
       0.110169
        27
       0.518519
       0.481481
       0.062222
       0.333333
        2.333333
    
    
      SKILLING JEFFREY K
       1111258
       3627
             0
         8682716
       5000000
       5600000
        6843672
       2042
       0
       26093672
      ...
       2160
       0.945370
       0.013889
       0.040741
       118
       0.254237
       0.745763
       0.008271
       0.814815
       18.907407
    
    
      YEAGER F SCOTT
        158403
          0
             0
          360300
       5000000
             0
        3576206
          0
       0
       11884758
      ...
          0
       0.000000
       0.000000
       0.000000
         0
       0.000000
       0.000000
       0.000000
       0.000000
        0.000000
    
  

18 rows × 61 columns

director_fees_total_pay_ratio, deferred_income_total_pay_ratio, exercised_stock_options_total_stock_ratio, exercised_stock_options_total_stock_ratio, restricted_stock_deferred_total_stock_ratio, restricted_stock_total_stock_ratio, director_fees_total_compensation_ratio, deferred_income_total_compensation_ratio, restricted_stock_total_compensation_ratio, restricted_stock_deferred_total_compensation_ratio



In [238]:

    
# Column slicing by number
df.ix[:,5:10].describe()









    Out[238]:






  
    
      
      bonus
      restricted_stock
      shared_receipt_with_poi
      restricted_stock_deferred
      total_stock_value
    
  
  
    
      count
           145.000000
            145.000000
        145.000000
            145.000000
            145.000000
    
    
      mean
        671335.303448
         862546.386207
        697.765517
          72911.572414
        2889718.124138
    
    
      std
       1230147.632511
        2010852.212383
       1075.128126
        1297469.064327
        6172223.035654
    
    
      min
             0.000000
       -2604490.000000
          0.000000
       -1787380.000000
         -44093.000000
    
    
      25%
             0.000000
              0.000000
          0.000000
              0.000000
         221141.000000
    
    
      50%
        300000.000000
         360528.000000
        114.000000
              0.000000
         955873.000000
    
    
      75%
        800000.000000
         698920.000000
        900.000000
              0.000000
        2282768.000000
    
    
      max
       8000000.000000
       14761694.000000
       5521.000000
       15456290.000000
       49110078.000000



In [212]:

    
#all_cols2 = np.concatenate([all_cols, np.array(['shared_poi_bins', 'ex_stock_bins', 
#                                                'total_stock_scaled', 'bonus_scaled',
#                                                'stock_log'])])

features = np.array(df.drop('poi', axis=1).columns)

clf = ExtraTreesClassifier(n_estimators=2000)
clf.fit(df[features], df['poi'])

importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(len(features)) + 0.5
plt.figure(figsize=(14,12))
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-212-297fbf6db390> in <module>()
      6 
      7 clf = ExtraTreesClassifier(n_estimators=2000)
----> 8 clf.fit(df[features], df['poi'])
      9 
     10 importances = clf.feature_importances_

c:\Anaconda\lib\site-packages\sklearn\ensemble\forest.pyc in fit(self, X, y, sample_weight)
    222 
    223         # Convert data
--> 224         X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
    225 
    226         # Remap output

c:\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in check_arrays(*arrays, **options)
    281                     array = np.asarray(array, dtype=dtype)
    282                 if not allow_nans:
--> 283                     _assert_all_finite(array)
    284 
    285             if not allow_nd and array.ndim >= 3:

c:\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in _assert_all_finite(X)
     41             and not np.isfinite(X).all()):
     42         raise ValueError("Input contains NaN, infinity"
---> 43                          " or a value too large for %r." % X.dtype)
     44 
     45 

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').



In [91]:

    
confusion_matrix(df['poi'], clf.predict(df))









    Out[91]:





array([[127,   0],
       [  0,  18]])



In [43]:

    
X_df = df[all_cols2]
y_df = df['poi']

Train



In [44]:

    
FINANCIAL_FIELDS = ['salary', 'deferral_payments', 'total_payments', 'exercised_stock_options', 
                  'bonus', 'restricted_stock', 'restricted_stock_deferred', 'total_stock_value',
                  'expenses', 'loan_advances', 'other', 'director_fees', 'deferred_income', 
                  'long_term_incentive', 'ex_stock_bins', 'stock_log']

EMAIL_FIELDS = ['from_messages', 'to_messages', 'shared_receipt_with_poi', 
              'from_this_person_to_poi', 'from_poi_to_this_person', 'email_address',
              'shared_poi_bins']



In [56]:

    
class ColumnExtractor(TransformerMixin):
    '''
    Columns extractor transformer for sklearn pipelines.
    Inherits fit_transform() from TransformerMixin, but this is explicitly
    defined here for clarity.
    
    Methods to extract pandas dataframe columns are defined for this class.
    
    '''
    def __init__(self, columns=[]):
        self.columns = columns
    
    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)
    
    def transform(self, X, **transform_params):
        '''
        Input: A pandas dataframe and a list of column names to extract.
        Output: A pandas dataframe containing only the columns of the names passed in.
        '''
        return X[self.columns]
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def get_params(self, deep=True):
        """Get parameters for this estimator.
        Parameters
        ----------
        deep: boolean, optional
            If True, will return the parameters for this estimator and
            contained subobjects that are estimators.
        Returns
        -------
        params : mapping of string to any
            Parameter names mapped to their values.
        """

        return self



In [87]:

    
X_df = df[['total_payments', 'total_stock_value', 'shared_receipt_with_poi', 'bonus']]
y_df = df['poi']

from sklearn.svm import LinearSVC
sk_fold = StratifiedShuffleSplit(y_df, n_iter=10, test_size=0.2) 
        
pipeline = Pipeline(steps=[#('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)),
                           ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)),
                           ('low_var_remover', VarianceThreshold(threshold=0.1)), 
                           #('feature_selection', LinearSVC()),
                           ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
                                                       criterion='gini', n_estimators=1500, n_jobs=1,
                                                       oob_score=True, random_state=None, verbose=0,
                                                       max_features='auto', min_samples_split=2,
                                                       min_samples_leaf=1))])
    
params = {'ET__n_estimators': [1500],
          'ET__max_features': ['auto', None],
          'ET__min_samples_split': [2, 4, 10],
          'ET__min_samples_leaf': [1, 2, 5],
          #'feature_selection__C': [0.1, 1, 10],
          #'ET__criterion' : ['gini', 'entropy'],
          #'imputer__strategy': ['median', 'mean'],
          'low_var_remover': [0, 0.1, .25, .50, .75]
          }
    
grid_search = GridSearchCV(pipeline, param_grid=params, cv=sk_fold, n_jobs = -1, scoring='f1')
grid_search.fit(X_df, y=y_df)
#test_pred = grid_search.predict(X_test)
#print "Cross_Val_score: ", cross_val_score(grid_search, X_train, y_train)
print "Best Estimator: ", grid_search.best_estimator_
    #f1_avg.append(f1_score(y_test, test_pred))
#print "F1: ", f1_score(y_test, test_pred)
#print "Confusion Matrix: "
#print confusion_matrix(y_test, test_pred)
#print "Accuracy Score: ", accuracy_score(y_test, test_pred)
print "Best Params: ", grid_search.best_params_









    



Best Estimator:  Pipeline(steps=[('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
           criterion='gini', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
           min_samples_split=4, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
Best Params:  {'ET__n_estimators': 1500, 'ET__min_samples_split': 4, 'low_var_remover': 0.25, 'ET__max_features': None, 'ET__min_samples_leaf': 1}



In [88]:

    
n_iter = 100
sk_fold = StratifiedShuffleSplit(y_df, n_iter=n_iter, test_size=0.1)
f1_avg = []
for train_index, test_index in sk_fold:
    X_train, X_test = X_df.irow(train_index), X_df.irow(test_index)
    y_train, y_test = y_df[train_index], y_df[test_index]

    grid_search.best_estimator_.fit(X_train, y=y_train)
    # pipeline.fit(X_train, y=y_train)
    test_pred = grid_search.predict(X_test)
    #test_pred = pipeline.predict(X_test)

    #print "Cross_Val_score: ", cross_val_score(grid_search, X_train, y_train)
    #print "Best Estimator: ", grid_search.best_estimator_
    #print f1_score(y_test, test_pred)
    f1_avg.append(f1_score(y_test, test_pred))
print sum(f1_avg)/n_iter









    



0.236333333333



In [ ]:

    
pipeline = Pipeline(steps=[#('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)),
                           #('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)),
                           #('low_var_remover', VarianceThreshold(threshold=0.1)), 
                           #('feature_selection', LinearSVC()),
                           ('features', FeatureUnion([
                                ('financial', Pipeline([
                                    ('extract', ColumnExtractor(FINANCIAL_FIELDS)),
                                    ('scale', StandardScaler()),
                                    ('reduce', LinearSVC())
                                ])),

                                ('email', Pipeline([
                                    ('extract2', ColumnExtractor(EMAIL_FIELDS)),
                                    ('scale2', StandardScaler()),
                                    ('reduce2', LinearSVC())
                                ]))

                            ])),
                           ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
                                                       criterion='gini', n_estimators=1500, n_jobs=1,
                                                       oob_score=True, random_state=None, verbose=0,
                                                       max_features=None, min_samples_split=2,
                                                       min_samples_leaf=1))
                            ])



In [ ]:



In [ ]:

	poi	from_messages	to_messages	shared_receipt_with_poi	total_active_poi_interaction
ALLEN PHILLIP K	False	2195	2902	1407	112
BADUM JAMES P	False	0	0	0	0
BANNANTINE JAMES M	False	29	566	465	39
BAXTER JOHN C	False	0	0	0	0
BAY FRANKLIN R	False	0	0	0	0
BAZELIDES PHILIP J	False	0	0	0	0
BECK SALLY W	False	4343	7315	2639	530
BELDEN TIMOTHY N	True	484	7991	5521	336
BELFER ROBERT	False	0	0	0	0
BERBERIAN DAVID	False	0	0	0	0
BERGSIEKER RICHARD P	False	59	383	233	4
BHATNAGAR SANJAY	False	29	523	463	1
BIBI PHILIPPE A	False	40	1607	1336	31
BLACHMAN JEREMY M	False	14	2475	2326	27
BLAKE JR. NORMAN P	False	0	0	0	0
BOWEN JR RAYMOND M	True	27	1858	1593	155
BROWN MICHAEL	False	41	1486	761	14
BUCHANAN HAROLD G	False	125	1088	23	0
BUTTS ROBERT H	False	0	0	0	0
BUY RICHARD B	False	1053	3523	2333	227
CALGER CHRISTOPHER F	True	144	2598	2188	224
CARTER REBECCA C	False	15	312	196	36
CAUSEY RICHARD A	True	49	1892	1585	70
CHAN RONNIE	False	0	0	0	0
CHRISTODOULOU DIOMEDES	False	0	0	0	0
CLINE KENNETH W	False	0	0	0	0
COLWELL WESLEY	True	40	1758	1132	251
CORDES WILLIAM R	False	12	764	58	10
COX DAVID	False	33	102	71	4
CUMBERLAND MICHAEL S	False	0	0	0	0
...	...	...	...	...	...
SCRIMSHAW MATTHEW	False	0	0	0	0
SHANKMAN JEFFREY A	False	2681	3221	1730	177
SHAPIRO RICHARD S	False	1215	15149	4527	139
SHARP VICTORIA T	False	136	3136	2477	30
SHELBY REX	True	39	225	91	27
SHERRICK JEFFREY B	False	25	613	583	57
SHERRIFF JOHN R	False	92	3187	2103	51
SKILLING JEFFREY K	True	108	3627	2042	118
STABLER FRANK	False	0	0	0	0
SULLIVAN-SHAKLOVITZ COLLEEN	False	0	0	0	0
SUNDE MARTIN	False	38	2647	2565	50
TAYLOR MITCHELL S	False	29	533	300	0
THE TRAVEL AGENCY IN THE PARK	False	0	0	0	0
THORN TERENCE H	False	41	266	73	0
TILNEY ELIZABETH A	False	19	460	379	21
UMANOFF ADAM S	False	18	111	41	12
URQUHART JOHN A	False	0	0	0	0
WAKEHAM JOHN	False	0	0	0	0
WALLS JR ROBERT H	False	146	671	215	17
WALTERS GARETH W	False	0	0	0	0
WASAFF GEORGE	False	30	400	337	29
WESTFAHL RICHARD K	False	0	0	0	0
WHALEY DAVID A	False	0	0	0	0
WHALLEY LAWRENCE G	False	556	6019	3920	210
WHITE JR THOMAS E	False	0	0	0	0
WINOKUR JR. HERBERT S	False	0	0	0	0
WODRASKA JOHN	False	0	0	0	0
WROBEL BRUCE	False	0	0	0	0
YEAGER F SCOTT	True	0	0	0	0
YEAP SOON	False	0	0	0	0

	salary	to_messages	deferral_payments	total_payments	exercised_stock_options	bonus	restricted_stock	shared_receipt_with_poi	total_stock_value	...	total_poi_interaction	shared_receipt_with_poi_total_poi_int_ratio	from_this_person_to_poi_total_poi_int_ratio	from_poi_to_this_person_total_poi_int_ratio	total_active_poi_interaction	to_poi_total_active_poi_ratio	from_poi_total_active_poi_ratio	to_messages_to_poi_ratio	from_messages_from_poi_ratio	shared_poi_from_messages_ratio
BELDEN TIMOTHY N	213999	7991	2144013	5501630	953136	5249999	157569	5521	1110705	...	5857	0.942633	0.018439	0.038928	336	0.321429	0.678571	0.013515	0.471074	11.407025
BOWEN JR RAYMOND M	278601	1858	0	2669589	0	1350000	252055	1593	252055	...	1748	0.911327	0.008581	0.080092	155	0.096774	0.903226	0.008073	5.185185	59.000000
CALGER CHRISTOPHER F	240189	2598	0	1639297	0	1250000	126027	2188	126027	...	2412	0.907131	0.010365	0.082504	224	0.111607	0.888393	0.009623	1.381944	15.194444
CAUSEY RICHARD A	415189	1892	0	1868758	0	1000000	2502063	1585	2502063	...	1655	0.957704	0.007251	0.035045	70	0.171429	0.828571	0.006342	1.183673	32.346939
COLWELL WESLEY	288542	1758	27610	1490344	0	1200000	698242	1132	698242	...	1383	0.818510	0.007954	0.173536	251	0.043825	0.956175	0.006257	6.000000	28.300000
DELAINEY DAVID W	365163	3093	0	4747979	2291113	3000000	1323148	2097	3614261	...	2772	0.756494	0.219697	0.023810	675	0.902222	0.097778	0.196896	0.021505	0.683284
FASTOW ANDREW S	440698	0	0	2424083	0	1300000	1794412	0	1794412	...	0	0.000000	0.000000	0.000000	0	0.000000	0.000000	0.000000	0.000000	0.000000
GLISAN JR BEN F	274975	873	0	1272284	384728	600000	393818	874	778546	...	932	0.937768	0.006438	0.055794	58	0.103448	0.896552	0.006873	3.250000	54.625000
HANNON KEVIN P	243293	1045	0	288682	5000000	1500000	853064	1035	6391065	...	1088	0.951287	0.019301	0.029412	53	0.396226	0.603774	0.020096	1.000000	32.343750
HIRKO JOSEPH	0	0	10259	91093	5000000	0	0	0	30766064	...	0	0.000000	0.000000	0.000000	0	0.000000	0.000000	0.000000	0.000000	0.000000
KOENIG MARK E	309946	2374	0	1587421	671737	700000	1248318	2271	1920055	...	2339	0.970928	0.006413	0.022659	68	0.220588	0.779412	0.006318	0.868852	37.229508
KOPPER MICHAEL J	224305	0	0	2652612	0	800000	985032	0	985032	...	0	0.000000	0.000000	0.000000	0	0.000000	0.000000	0.000000	0.000000	0.000000
LAY KENNETH L	1072321	4273	202911	103559793	5000000	7000000	14761694	2411	49110078	...	2550	0.945490	0.006275	0.048235	139	0.115108	0.884892	0.003744	3.416667	66.972222
RICE KENNETH D	420636	905	0	505050	5000000	1750000	2748364	864	22542539	...	910	0.949451	0.004396	0.046154	46	0.086957	0.913043	0.004420	2.333333	48.000000
RIEKER PAULA H	249201	1328	214678	1099100	1635238	700000	283649	1258	1918887	...	1341	0.938106	0.035794	0.026100	83	0.578313	0.421687	0.036145	0.426829	15.341463
SHELBY REX	211844	225	0	2003885	1624396	200000	869220	91	2493616	...	118	0.771186	0.118644	0.110169	27	0.518519	0.481481	0.062222	0.333333	2.333333
SKILLING JEFFREY K	1111258	3627	0	8682716	5000000	5600000	6843672	2042	26093672	...	2160	0.945370	0.013889	0.040741	118	0.254237	0.745763	0.008271	0.814815	18.907407
YEAGER F SCOTT	158403	0	0	360300	5000000	0	3576206	0	11884758	...	0	0.000000	0.000000	0.000000	0	0.000000	0.000000	0.000000	0.000000	0.000000

	bonus	restricted_stock	shared_receipt_with_poi	restricted_stock_deferred	total_stock_value
count	145.000000	145.000000	145.000000	145.000000	145.000000
mean	671335.303448	862546.386207	697.765517	72911.572414	2889718.124138
std	1230147.632511	2010852.212383	1075.128126	1297469.064327	6172223.035654
min	0.000000	-2604490.000000	0.000000	-1787380.000000	-44093.000000
25%	0.000000	0.000000	0.000000	0.000000	221141.000000
50%	300000.000000	360528.000000	114.000000	0.000000	955873.000000
75%	800000.000000	698920.000000	900.000000	0.000000	2282768.000000
max	8000000.000000	14761694.000000	5521.000000	15456290.000000	49110078.000000